The second data product for my university course. It’s aim is to predict housing prices using regression.
The used data is the Ames Housing dataset, from kaggle.com It’s split into two files, a train.csv and a test.csv.
Created by: Dobosi Péter MW79ON

Business Understanding:

First let’s decide what the problem exactly is, and what do we want to achieve. The exact overview of the problem can be read here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview

In summary, we have to create a model, which can predict the sale price of a house, given it’s parameters, using regression.

My plan is the following:

  1. Explore the data.
  2. Clean and transform it into a usable format.
  3. Choose, build and verify a model.
  4. Evaluate it.
  5. Use it to predict on the final dataset.

First things first, let’s load the data and take a quick look at it:

data <- read.csv("./train.csv")
data

We have more than 1400 objects and 79 parameters. We have NAs and everything is simply either characters or integers.
To properly prepare the data we have to handle the empty values one way or another, and also convert the characters to factors.

It seems like we have a couple of parameters that are mostly NAs:

alley_na_percentage <- sum(is.na(data$Alley))/1460*100
pool_na_percentage <- sum(is.na(data$PoolQC))/1460*100
fence_na_percentage <- sum(is.na(data$Fence))/1460*100
feature_na_percentage <- sum(is.na(data$MiscFeature))/1460*100

paste("Percentage of NAs in the field Alley:", alley_na_percentage, "%")
[1] "Percentage of NAs in the field Alley: 93.7671232876712 %"
paste("Percentage of NAs in the field PoolQC:", pool_na_percentage, "%")
[1] "Percentage of NAs in the field PoolQC: 99.5205479452055 %"
paste("Percentage of NAs in the field Fence:", fence_na_percentage, "%")
[1] "Percentage of NAs in the field Fence: 80.7534246575342 %"
paste("Percentage of NAs in the field MiscFeature:", feature_na_percentage, "%")
[1] "Percentage of NAs in the field MiscFeature: 96.3013698630137 %"

With the help of data_description.txt we can decipher what do these mean:
(Only the relevant parts here, read the rest from the file if you are interested.)

Alley:
Type of alley access to property.

   Grvl Gravel
   Pave Paved
   NA   No alley access

PoolQC:
Pool quality.

   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   NA   No Pool

Fence:
Fence quality.

   GdPrv    Good Privacy
   MnPrv    Minimum Privacy
   GdWo Good Wood
   MnWw Minimum Wood/Wire
   NA   No Fence

MiscFeature:
Miscellaneous feature not covered in other categories.

   Elev Elevator
   Gar2 2nd Garage (if not described in garage section)
   Othr Other
   Shed Shed (over 100 SF)
   TenC Tennis Court
   NA   None

As we can see, they don’t mean that we have no information on those parameters of the buildings. Rather their meaning is simply that they lack the things described by those parameters. This is important information, we can’t just drop, or guess them from based on the others.

Exploratory Data Analysis:

Let’s find out more about the data’s characteristics!
Let’s take a quick look at the parameters of the dataset:

names(data)
 [1] "Id"            "MSSubClass"    "MSZoning"      "LotFrontage"   "LotArea"       "Street"        "Alley"        
 [8] "LotShape"      "LandContour"   "Utilities"     "LotConfig"     "LandSlope"     "Neighborhood"  "Condition1"   
[15] "Condition2"    "BldgType"      "HouseStyle"    "OverallQual"   "OverallCond"   "YearBuilt"     "YearRemodAdd" 
[22] "RoofStyle"     "RoofMatl"      "Exterior1st"   "Exterior2nd"   "MasVnrType"    "MasVnrArea"    "ExterQual"    
[29] "ExterCond"     "Foundation"    "BsmtQual"      "BsmtCond"      "BsmtExposure"  "BsmtFinType1"  "BsmtFinSF1"   
[36] "BsmtFinType2"  "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"   "Heating"       "HeatingQC"     "CentralAir"   
[43] "Electrical"    "X1stFlrSF"     "X2ndFlrSF"     "LowQualFinSF"  "GrLivArea"     "BsmtFullBath"  "BsmtHalfBath" 
[50] "FullBath"      "HalfBath"      "BedroomAbvGr"  "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd"  "Functional"   
[57] "Fireplaces"    "FireplaceQu"   "GarageType"    "GarageYrBlt"   "GarageFinish"  "GarageCars"    "GarageArea"   
[64] "GarageQual"    "GarageCond"    "PavedDrive"    "WoodDeckSF"    "OpenPorchSF"   "EnclosedPorch" "X3SsnPorch"   
[71] "ScreenPorch"   "PoolArea"      "PoolQC"        "Fence"         "MiscFeature"   "MiscVal"       "MoSold"       
[78] "YrSold"        "SaleType"      "SaleCondition" "SalePrice"    

As we can see, we have roughly 80 parameters.
Let’s take a look at the types of the parameters:

str(data)
'data.frame':   1460 obs. of  81 variables:
 $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
 $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
 $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
 $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
 $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
 $ Alley        : chr  NA NA NA NA ...
 $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
 $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
 $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
 $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
 $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
 $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
 $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
 $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
 $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
 $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
 $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
 $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
 $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
 $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
 $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
 $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
 $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
 $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
 $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
 $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
 $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
 $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
 $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
 $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
 $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
 $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
 $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
 $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
 $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
 $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
 $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
 $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
 $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
 $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
 $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
 $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
 $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
 $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
 $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
 $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
 $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
 $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
 $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
 $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
 $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
 $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
 $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
 $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
 $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
 $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
 $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
 $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
 $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
 $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
 $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
 $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
 $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
 $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
 $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
 $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
 $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
 $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
 $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
 $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolQC       : chr  NA NA NA NA ...
 $ Fence        : chr  NA NA NA NA ...
 $ MiscFeature  : chr  NA NA NA NA ...
 $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
 $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
 $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
 $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
 $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
 $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

All of our parameters are either character or integer vectors.
Let’s take a look at the summary:

summary(data)
       Id           MSSubClass      MSZoning          LotFrontage        LotArea          Street         
 Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00   Min.   :  1300   Length:1460       
 1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00   1st Qu.:  7554   Class :character  
 Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00   Median :  9478   Mode  :character  
 Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05   Mean   : 10517                     
 3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00   3rd Qu.: 11602                     
 Max.   :1460.0   Max.   :190.0                      Max.   :313.00   Max.   :215245                     
                                                     NA's   :259                                         
    Alley             LotShape         LandContour         Utilities          LotConfig          LandSlope        
 Length:1460        Length:1460        Length:1460        Length:1460        Length:1460        Length:1460       
 Class :character   Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                                                                  
                                                                                                                  
                                                                                                                  
                                                                                                                  
 Neighborhood        Condition1         Condition2          BldgType          HouseStyle         OverallQual    
 Length:1460        Length:1460        Length:1460        Length:1460        Length:1460        Min.   : 1.000  
 Class :character   Class :character   Class :character   Class :character   Class :character   1st Qu.: 5.000  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
                                                                                                Mean   : 6.099  
                                                                                                3rd Qu.: 7.000  
                                                                                                Max.   :10.000  
                                                                                                                
  OverallCond      YearBuilt     YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
 Min.   :1.000   Min.   :1872   Min.   :1950   Length:1460        Length:1460        Length:1460       
 1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967   Class :character   Class :character   Class :character  
 Median :5.000   Median :1973   Median :1994   Mode  :character   Mode  :character   Mode  :character  
 Mean   :5.575   Mean   :1971   Mean   :1985                                                           
 3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004                                                           
 Max.   :9.000   Max.   :2010   Max.   :2010                                                           
                                                                                                       
 Exterior2nd         MasVnrType          MasVnrArea      ExterQual          ExterCond          Foundation       
 Length:1460        Length:1460        Min.   :   0.0   Length:1460        Length:1460        Length:1460       
 Class :character   Class :character   1st Qu.:   0.0   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Median :   0.0   Mode  :character   Mode  :character   Mode  :character  
                                       Mean   : 103.7                                                           
                                       3rd Qu.: 166.0                                                           
                                       Max.   :1600.0                                                           
                                       NA's   :8                                                                
   BsmtQual           BsmtCond         BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
 Length:1460        Length:1460        Length:1460        Length:1460        Min.   :   0.0   Length:1460       
 Class :character   Class :character   Class :character   Class :character   1st Qu.:   0.0   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
                                                                             Mean   : 443.6                     
                                                                             3rd Qu.: 712.2                     
                                                                             Max.   :5644.0                     
                                                                                                                
   BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating           HeatingQC          CentralAir       
 Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460        Length:1460        Length:1460       
 1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character   Class :character   Class :character  
 Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character   Mode  :character   Mode  :character  
 Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                                                           
 3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                                                           
 Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                                                           
                                                                                                             
  Electrical          X1stFlrSF      X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
 Length:1460        Min.   : 334   Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
 Class :character   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
 Mode  :character   Median :1087   Median :   0   Median :  0.000   Median :1464   Median :0.0000  
                    Mean   :1163   Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
                    3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
                    Max.   :4692   Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
                                                                                                   
  BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr    KitchenAbvGr   KitchenQual       
 Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000   Min.   :0.000   Length:1460       
 1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   Class :character  
 Median :0.00000   Median :2.000   Median :0.0000   Median :3.000   Median :1.000   Mode  :character  
 Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866   Mean   :1.047                     
 3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000                     
 Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000   Max.   :3.000                     
                                                                                                      
  TotRmsAbvGrd     Functional          Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
 Min.   : 2.000   Length:1460        Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
 1st Qu.: 5.000   Class :character   1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
 Median : 6.000   Mode  :character   Median :1.000   Mode  :character   Mode  :character   Median :1980  
 Mean   : 6.518                      Mean   :0.613                                         Mean   :1979  
 3rd Qu.: 7.000                      3rd Qu.:1.000                                         3rd Qu.:2002  
 Max.   :14.000                      Max.   :3.000                                         Max.   :2010  
                                                                                           NA's   :81    
 GarageFinish         GarageCars      GarageArea      GarageQual         GarageCond         PavedDrive       
 Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460        Length:1460        Length:1460       
 Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character   Class :character   Class :character  
 Mode  :character   Median :2.000   Median : 480.0   Mode  :character   Mode  :character   Mode  :character  
                    Mean   :1.767   Mean   : 473.0                                                           
                    3rd Qu.:2.000   3rd Qu.: 576.0                                                           
                    Max.   :4.000   Max.   :1418.0                                                           
                                                                                                             
   WoodDeckSF      OpenPorchSF     EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
 Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
 1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
 Median :  0.00   Median : 25.00   Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
 Mean   : 94.24   Mean   : 46.66   Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
 3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
 Max.   :857.00   Max.   :547.00   Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
                                                                                                       
    PoolQC             Fence           MiscFeature           MiscVal             MoSold           YrSold    
 Length:1460        Length:1460        Length:1460        Min.   :    0.00   Min.   : 1.000   Min.   :2006  
 Class :character   Class :character   Class :character   1st Qu.:    0.00   1st Qu.: 5.000   1st Qu.:2007  
 Mode  :character   Mode  :character   Mode  :character   Median :    0.00   Median : 6.000   Median :2008  
                                                          Mean   :   43.49   Mean   : 6.322   Mean   :2008  
                                                          3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009  
                                                          Max.   :15500.00   Max.   :12.000   Max.   :2010  
                                                                                                            
   SaleType         SaleCondition        SalePrice     
 Length:1460        Length:1460        Min.   : 34900  
 Class :character   Class :character   1st Qu.:129975  
 Mode  :character   Mode  :character   Median :163000  
                                       Mean   :180921  
                                       3rd Qu.:214000  
                                       Max.   :755000  
                                                       

A lot of our attributes are character vectors, which we can’t summarize this way.

One dimensional examination

In which we find out more about given parameters of the data.
Let’s take a visual look:

plot(data$LotArea)

plot(data$LotArea, ylim = c(1000, 20000))

We can see, that in terms of Area, most of the properties are between 1000 and 20000 square feet, with a few outliers.
Let’s check out how many houses were built in each year:

hist(data$YearBuilt)

We can also see, a tendency towards newly built homes.

plot(table(data$Fireplaces))
grid()

Multidimensional examination:

Now let’s take a look at multiple parameters at the same time:

library(car)
scatterplot(data$YearBuilt, data$SalePrice, regLine = list(col="green"), smooth=list(col.smooth="red", col.spread="black"))

We might see some kind of exponential pattern, given the higher prices of newly built homes.

scatterplot(data$LotArea, data$SalePrice, xlim = c(1000, 20000), regLine = list(col="green"), smooth=list(col.smooth="red", col.spread="black"))

We can’t see any clear correlation between the area and the price of a property.

Just out of curiosity, I tried to draw the feature plot of the data. To my surprise, with a couple of tweaks, and given some time it actually worked:


# This took a couple minutes, but it worked. It's about 64k square meters.
# The plot is basically unreadable, but it shows that there is correlation
# between a couple of the parameters.

# pdf(file = "/home/peter/test.pdf",
#     width = 10000,
#     height = 10000)

# plot(data)

# dev.off()

Although we couldn’t really learn anything from it, due to it’s size it’s unreadable.

Let’s check out the covariance matrix:

# cov(data)

This doesn’t work because there are non numeric or logical values in our dataframe still. It’s time we cleaned up the data a bit. But before we do that, let’s try to take a look at the correlations between a couple other parameters.

library(corrplot)
cors <- cor(data[,c(5,19,20,39,47,63,72,77,81)], use = "complete.obs")
corrplot(cors, type = "lower")

We can see correlation between the price and a couple of parameters, such as the size of the living area.

Data Cleaning:

There is a lot to do. We have NAs, and non numeric values everywhere. Let’s start by dealing with the NAs first.
We need to find out which columns contain any NAs:

na_cols <- names(which(colSums(is.na(data)) > 0))
na_cols
 [1] "LotFrontage"  "Alley"        "MasVnrType"   "MasVnrArea"   "BsmtQual"     "BsmtCond"     "BsmtExposure"
 [8] "BsmtFinType1" "BsmtFinType2" "Electrical"   "FireplaceQu"  "GarageType"   "GarageYrBlt"  "GarageFinish"
[15] "GarageQual"   "GarageCond"   "PoolQC"       "Fence"        "MiscFeature" 

We have 19 columns containing NAs, let’s find out more about them.
Let’s find out how many NAs do these columns have:

get_na_count <- function(column_name) {
    sum(is.na(data[column_name]))
}

na_counts <- data.frame(sapply(na_cols, get_na_count))

library(data.table)
na_stats <- transpose(na_counts)

colnames(na_stats) <- na_cols
rownames(na_stats) <- c("NA count")

calc_na_percentage <- function(column_name) {
    get_na_count(column_name = column_name)/nrow(data) * 100
}

na_stats[nrow(na_stats) + 1,] = sapply(na_cols, calc_na_percentage)
rownames(na_stats) <- c("NA count", "NA percentage")

na_stats

As we can see, we have some parameters that are mostly NAs, while others only contain a few of them.
Let’s deal with them appropriately, now that we know more about them.
First, the LotFrontage parameter:

unique(data$LotFrontage)
  [1]  65  80  68  60  84  85  75  NA  51  50  70  91  72  66 101  57  44 110  98  47 108 112  74 115  61  48  33
 [28]  52 100  24  89  63  76  81  95  69  21  32  78 121 122  40 105  73  77  64  94  34  90  55  88  82  71 120
 [55] 107  92 134  62  86 141  97  54  41  79 174  99  67  83  43 103  93  30 129 140  35  37 118  87 116 150 111
 [82]  49  96  59  36  56 102  58  38 109 130  53 137  45 106 104  42  39 144 114 128 149 313 168 182 138 160 152
[109] 124 153  46

The description doesn’t say anything about NAs in this parameter, but as we can see, there aren’t any zeros here. So I’ll assume that NAs mean zero here as well, as it does in most of the other parameters.
Let’s fill them in now:

data[is.na(data$LotFrontage),]$LotFrontage <- 0

The Alley parameter:
The data_description.txt says that NAs in this parameter mean, that there is no alley access, to the given property.

unique(data$Alley)
[1] NA     "Grvl" "Pave"

Later I’ll probably convert all character vectors to factors, so let’s leave this as is.
Now for the Masonry veneer type:

ms_types <- unique(data$MasVnrType)
ms_types
[1] "BrkFace" "None"    "Stone"   "BrkCmn"  NA       

We have a handful of NAs, but here they do not simply mean that there is no such thing as what’s being described by the parameter. We have to actually fill them in.
Let’s do so by the most frequent value:

get_ms_count <- function(unique_value){
    sum(data$MasVnrType == unique_value, na.rm = T)
}

sapply(ms_types, get_ms_count)
BrkFace    None   Stone  BrkCmn    <NA> 
    445     864     128      15       0 

As we can see, the most common option is None, so let’s assume that NAs are None:

data[is.na(data$MasVnrType),]$MasVnrType <- "None"

We have to do the same for Masonry veneer area as well, but with 0s this time:

data[is.na(data$MasVnrArea),]$MasVnrArea <- 0

BsmtQual is next:

unique(data$BsmtQual)
[1] "Gd" "TA" "Ex" NA   "Fa"

According to the description, NAs here mean, that the property has no basement. Let’s leave this as is.
The same is true for BsmtCond, BsmtExposure, BsmtFinSF1, BsmtFinType1 and BsmtFinType2.
Electrical is up next:

elec_types <- unique(data$Electrical)
elec_types
[1] "SBrkr" "FuseF" "FuseA" "FuseP" "Mix"   NA     

The description doesn’t say anything about the one missing value, so let’s fill it with the most frequent value:


# TODO I need to change these to reusable methods.

get_elec_count <- function(unique_value){
    sum(data$Electrical == unique_value, na.rm = T)
}

sapply(elec_types, get_elec_count)
SBrkr FuseF FuseA FuseP   Mix  <NA> 
 1334    27    94     3     1     0 

As we can see, the Standard Breaker is the most common, let’s assume the missing value is that:

data[is.na(data$Electrical),]$Electrical <- "SBrkr"

FireplaceQu is next:
The description says that NAs here mean that there is no fireplace, so let’s leave this as is.
The same deal for all the parameters describing the garages.
PoolQC and Fence also behave the exact same way.
Finally the last one, MiscFeature. This one is similar, NAs simply mean that there aren’t any misc features.

Finally after all this hard work, we shouldn’t have any NAs left in our dataframe, where they don’t make any sense Let’s check whether that’s true:

names(which(colSums(is.na(data)) > 0))
 [1] "Alley"        "BsmtQual"     "BsmtCond"     "BsmtExposure" "BsmtFinType1" "BsmtFinType2" "FireplaceQu" 
 [8] "GarageType"   "GarageYrBlt"  "GarageFinish" "GarageQual"   "GarageCond"   "PoolQC"       "Fence"       
[15] "MiscFeature" 

It is!

After we’ve dealt with all of the NAs, let’s check whether everything is the correct type:

str(data)
'data.frame':   1460 obs. of  81 variables:
 $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
 $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
 $ LotFrontage  : num  65 80 68 60 84 85 75 0 51 50 ...
 $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
 $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
 $ Alley        : chr  NA NA NA NA ...
 $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
 $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
 $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
 $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
 $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
 $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
 $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
 $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
 $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
 $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
 $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
 $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
 $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
 $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
 $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
 $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
 $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
 $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
 $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
 $ MasVnrArea   : num  196 0 162 0 350 0 186 240 0 0 ...
 $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
 $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
 $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
 $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
 $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
 $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
 $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
 $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
 $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
 $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
 $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
 $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
 $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
 $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
 $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
 $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
 $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
 $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
 $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
 $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
 $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
 $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
 $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
 $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
 $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
 $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
 $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
 $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
 $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
 $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
 $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
 $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
 $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
 $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
 $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
 $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
 $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
 $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
 $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
 $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
 $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
 $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
 $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
 $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolQC       : chr  NA NA NA NA ...
 $ Fence        : chr  NA NA NA NA ...
 $ MiscFeature  : chr  NA NA NA NA ...
 $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
 $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
 $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
 $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
 $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
 $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

Nothing seems out of order, but we still have a bunch of character vectors. We need to encode them in a way, that our models can use. Let’s convert them to factors. This way R can automatically dummy code them when building models.

First things first, we have to find out which parameters are strings, so we can know which ones to convert to factors:

char_parms <- colnames(data[sapply(data, is.character)])
char_parms
 [1] "MSZoning"      "Street"        "Alley"         "LotShape"      "LandContour"   "Utilities"     "LotConfig"    
 [8] "LandSlope"     "Neighborhood"  "Condition1"    "Condition2"    "BldgType"      "HouseStyle"    "RoofStyle"    
[15] "RoofMatl"      "Exterior1st"   "Exterior2nd"   "MasVnrType"    "ExterQual"     "ExterCond"     "Foundation"   
[22] "BsmtQual"      "BsmtCond"      "BsmtExposure"  "BsmtFinType1"  "BsmtFinType2"  "Heating"       "HeatingQC"    
[29] "CentralAir"    "Electrical"    "KitchenQual"   "Functional"    "FireplaceQu"   "GarageType"    "GarageFinish" 
[36] "GarageQual"    "GarageCond"    "PavedDrive"    "PoolQC"        "Fence"         "MiscFeature"   "SaleType"     
[43] "SaleCondition"

As we can see, we have a bit more than 40 parameters which are characters. Let’s convert them to factors:

data[char_parms] <- lapply(data[char_parms], factor)

Let’s check whether we were successful:

str(data)
'data.frame':   1460 obs. of  81 variables:
 $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
 $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
 $ LotFrontage  : num  65 80 68 60 84 85 75 0 51 50 ...
 $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
 $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
 $ Alley        : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
 $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
 $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
 $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
 $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
 $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
 $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
 $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
 $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
 $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
 $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
 $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
 $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
 $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
 $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
 $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
 $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
 $ MasVnrArea   : num  196 0 162 0 350 0 186 240 0 0 ...
 $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
 $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
 $ BsmtQual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
 $ BsmtCond     : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
 $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
 $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
 $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
 $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
 $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
 $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
 $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
 $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
 $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
 $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
 $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
 $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
 $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
 $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
 $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
 $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
 $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
 $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
 $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
 $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
 $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
 $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
 $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
 $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
 $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
 $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
 $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
 $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
 $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
 $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
 $ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
 $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
 $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
 $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
 $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
 $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
 $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolQC       : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
 $ Fence        : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
 $ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
 $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
 $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
 $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
 $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
 $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
 $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

We were!

Now, after cleaning the data, let’s check out the correlations, to see, which parameters should we pay more attention to.
First let’s see for the numeric values:

num_parms <- colnames(data[sapply(data, is.numeric)])
num_parms
 [1] "Id"            "MSSubClass"    "LotFrontage"   "LotArea"       "OverallQual"   "OverallCond"   "YearBuilt"    
 [8] "YearRemodAdd"  "MasVnrArea"    "BsmtFinSF1"    "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"   "X1stFlrSF"    
[15] "X2ndFlrSF"     "LowQualFinSF"  "GrLivArea"     "BsmtFullBath"  "BsmtHalfBath"  "FullBath"      "HalfBath"     
[22] "BedroomAbvGr"  "KitchenAbvGr"  "TotRmsAbvGrd"  "Fireplaces"    "GarageYrBlt"   "GarageCars"    "GarageArea"   
[29] "WoodDeckSF"    "OpenPorchSF"   "EnclosedPorch" "X3SsnPorch"    "ScreenPorch"   "PoolArea"      "MiscVal"      
[36] "MoSold"        "YrSold"        "SalePrice"    
numcors <- cor(data[,num_parms], use = "complete.obs")
corrplot(numcors, type = "lower")

This is hard to read, but we can already see that we don’t need all of these parameters.
Let’s check out the more relevant ones:

relevant_names <- names(numcors[38,numcors[38,] > 0.5])

relcors <- cor(data[,relevant_names], use = "complete.obs")
corrplot(relcors, type = "lower")

Let’s check them out:

scatterplot(data$OverallQual, data$SalePrice, regLine = list(col="green"), smooth=list(col.smooth="red", col.spread="black"))


scatterplot(data$GrLivArea, data$SalePrice, xlim = c(250, 3000), ylim = c(0, 500000), regLine = list(col="green"), smooth=list(col.smooth="red", col.spread="black"))


scatterplot(data$GrLivArea, data$TotRmsAbvGrd, regLine = list(col="green"), smooth=list(col.smooth="red", col.spread="black"))


scatterplot(data$OverallQual, data$GrLivArea, regLine = list(col="green"), smooth=list(col.smooth="red", col.spread="black"))


pairs(data[,c("SalePrice", "GrLivArea", "TotRmsAbvGrd")])

We can see a couple of obvious correlations, that don’t mean anything, such as: between the GarageArea and GarageCars, and GrLivArea and TotRmsAbvGrd.

Let’s calculate a new parameter, the price per square feet:

data$ppsqf <- data$SalePrice / data$GrLivArea
scatterplot(data$OverallQual, data$ppsqf, regLine = list(col="green"), smooth=list(col.smooth="red", col.spread="black"))

Let’s check out the correlation between this new parameters and the old ones:

relevant_names2 <- c(relevant_names, "ppsqf")
relevant_names2
 [1] "OverallQual"  "YearBuilt"    "YearRemodAdd" "TotalBsmtSF"  "X1stFlrSF"    "GrLivArea"    "FullBath"    
 [8] "TotRmsAbvGrd" "GarageCars"   "GarageArea"   "SalePrice"    "ppsqf"       
relcors2 <- cor(data[,relevant_names2], use = "complete.obs")
corrplot(relcors2, type = "lower")

It seems as the price per square feet has risen over the years. Let’s find it out:

scatterplot(data$YearBuilt, data$ppsqf, regLine = list(col="green"), smooth=list(col.smooth="red", col.spread="black"))

We were right.

Model Building:

Let’s build some models. I would like to use the caret package to build a linear and an exponential regression model.
Let’s create the data partitions first:

library(caret)

target <- data$SalePrice
trainIdx <- createDataPartition(target, p = .75)
traindata <- data[trainIdx$Resample1,]
testdata <- data[-trainIdx$Resample1,]
str(traindata)
'data.frame':   1097 obs. of  82 variables:
 $ Id           : int  1 2 4 5 6 8 9 10 11 13 ...
 $ MSSubClass   : int  60 20 70 60 50 60 50 190 20 20 ...
 $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 5 4 4 4 ...
 $ LotFrontage  : num  65 80 60 84 85 0 51 50 70 0 ...
 $ LotArea      : int  8450 9600 9550 14260 14115 10382 6120 7420 11200 12968 ...
 $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
 $ Alley        : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
 $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 4 4 2 ...
 $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
 $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 1 3 5 1 5 1 5 5 ...
 $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
 $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 7 14 12 17 18 4 19 19 ...
 $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 5 1 1 3 3 ...
 $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 1 3 3 ...
 $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 2 1 1 ...
 $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 1 6 1 2 3 3 ...
 $ OverallQual  : int  7 6 7 8 5 7 7 5 5 5 ...
 $ OverallCond  : int  5 8 5 5 5 6 5 6 5 6 ...
 $ YearBuilt    : int  2003 1976 1915 2000 1993 1973 1931 1939 1965 1962 ...
 $ YearRemodAdd : int  2003 1976 1970 2000 1995 1973 1950 1950 1965 1962 ...
 $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 4 4 ...
 $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 14 13 13 7 4 9 7 7 ...
 $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 16 14 14 7 16 9 7 11 ...
 $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 3 2 3 4 3 3 3 3 ...
 $ MasVnrArea   : num  196 0 0 350 0 240 0 0 0 0 ...
 $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 4 3 4 4 4 4 4 4 ...
 $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 1 3 6 2 1 1 2 2 ...
 $ BsmtQual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 4 3 3 3 4 4 4 4 ...
 $ BsmtCond     : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 2 4 4 4 4 4 4 4 ...
 $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 4 1 4 3 4 4 4 4 ...
 $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 1 3 3 1 6 3 5 1 ...
 $ BsmtFinSF1   : int  706 978 216 655 732 859 0 851 906 737 ...
 $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 2 6 6 6 6 ...
 $ BsmtFinSF2   : int  0 0 0 0 0 32 0 0 0 0 ...
 $ BsmtUnfSF    : int  150 284 540 490 64 216 952 140 134 175 ...
 $ TotalBsmtSF  : int  856 1262 756 1145 796 1107 952 991 1040 912 ...
 $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 3 1 1 1 3 1 1 5 ...
 $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
 $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 2 5 5 5 ...
 $ X1stFlrSF    : int  856 1262 961 1145 796 1107 1022 1077 1040 912 ...
 $ X2ndFlrSF    : int  854 0 756 1053 566 983 752 0 0 0 ...
 $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
 $ GrLivArea    : int  1710 1262 1717 2198 1362 2090 1774 1077 1040 912 ...
 $ BsmtFullBath : int  1 0 1 1 1 1 0 1 1 1 ...
 $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
 $ FullBath     : int  2 2 1 2 1 2 2 1 1 1 ...
 $ HalfBath     : int  1 0 0 1 1 1 0 0 0 0 ...
 $ BedroomAbvGr : int  3 3 3 4 1 3 2 2 3 2 ...
 $ KitchenAbvGr : int  1 1 1 1 1 1 2 2 1 1 ...
 $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 4 4 4 4 4 4 ...
 $ TotRmsAbvGrd : int  8 6 7 9 5 7 8 5 5 4 ...
 $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 3 7 7 7 ...
 $ Fireplaces   : int  0 1 1 1 0 2 2 2 0 0 ...
 $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 3 5 NA 5 5 5 NA NA ...
 $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 6 2 2 2 6 2 6 6 ...
 $ GarageYrBlt  : int  2003 1976 1998 2000 1993 1973 1931 1939 1965 1962 ...
 $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 3 2 3 2 3 2 3 3 ...
 $ GarageCars   : int  2 2 3 3 2 2 2 1 1 1 ...
 $ GarageArea   : int  548 460 642 836 480 484 468 205 384 352 ...
 $ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 2 3 5 5 ...
 $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
 $ WoodDeckSF   : int  0 298 0 192 40 235 90 0 0 140 ...
 $ OpenPorchSF  : int  61 0 35 84 30 204 0 4 0 0 ...
 $ EnclosedPorch: int  0 0 272 0 0 228 205 0 0 0 ...
 $ X3SsnPorch   : int  0 0 0 0 320 0 0 0 0 0 ...
 $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 176 ...
 $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolQC       : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
 $ Fence        : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA 3 NA NA NA NA NA ...
 $ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA 3 3 NA NA NA NA ...
 $ MiscVal      : int  0 0 0 0 700 350 0 0 0 0 ...
 $ MoSold       : int  2 5 2 12 10 11 4 1 2 9 ...
 $ YrSold       : int  2008 2007 2006 2008 2009 2009 2008 2008 2008 2008 ...
 $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
 $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 1 5 5 5 1 5 5 5 ...
 $ SalePrice    : int  208500 181500 140000 250000 143000 200000 129900 118000 129500 144000 ...
 $ ppsqf        : num  121.9 143.8 81.5 113.7 105 ...

After creating the partitions, let’s build the model.
First let’s just use one parameter:

model <- lm(SalePrice~OverallQual, data = traindata)
summary(model)

Call:
lm(formula = SalePrice ~ OverallQual, data = traindata)

Residuals:
    Min      1Q  Median      3Q     Max 
-176057  -29307   -1932   20693  394193 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -97943       6761  -14.49   <2e-16 ***
OverallQual    45875       1085   42.26   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 49240 on 1095 degrees of freedom
Multiple R-squared:   0.62, Adjusted R-squared:  0.6196 
F-statistic:  1786 on 1 and 1095 DF,  p-value: < 2.2e-16
plot(model)

shapiro.test(model$residuals)

    Shapiro-Wilk normality test

data:  model$residuals
W = 0.88868, p-value < 2.2e-16
confint(model)
                 2.5 %    97.5 %
(Intercept) -111209.66 -84676.94
OverallQual   43745.25  48004.80
cor(traindata$SalePrice, model$fitted.values)
[1] 0.7873731
model

Call:
lm(formula = SalePrice ~ OverallQual, data = traindata)

Coefficients:
(Intercept)  OverallQual  
     -97943        45875  
prediction <- predict(model, testdata, type="response")
model_output <- cbind(testdata, prediction)

model_output$log_prediction <- log(model_output$prediction)
NaNs produced
model_output$log_SalePrice <- log(model_output$SalePrice)

rmse <- function(fittedvals, truevals){
  sqrt(mean((fittedvals - truevals)^2))
}

rmse(model_output$log_SalePrice,model_output$log_prediction)
[1] NaN

As we can see, our model isn’t any good. Let’s try a different approach, with more parameters:

model2 <- lm(SalePrice~OverallQual+GrLivArea, data = traindata)
summary(model2)

Call:
lm(formula = SalePrice ~ OverallQual + GrLivArea, data = traindata)

Residuals:
    Min      1Q  Median      3Q     Max 
-312839  -23114    -499   20362  284307 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -108567.53    5883.22  -18.45   <2e-16 ***
OverallQual   32823.44    1162.10   28.25   <2e-16 ***
GrLivArea        59.44       3.11   19.11   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 42650 on 1094 degrees of freedom
Multiple R-squared:  0.7151,    Adjusted R-squared:  0.7146 
F-statistic:  1373 on 2 and 1094 DF,  p-value: < 2.2e-16

Let’s evaluate it:

plot(model2)

shapiro.test(model2$residuals)

    Shapiro-Wilk normality test

data:  model2$residuals
W = 0.90193, p-value < 2.2e-16
confint(model2)
                    2.5 %     97.5 %
(Intercept) -120111.19159 -97023.865
OverallQual   30543.23576  35103.644
GrLivArea        53.33354     65.538
cor(traindata$SalePrice, model2$fitted.values)
[1] 0.8456234
model2

Call:
lm(formula = SalePrice ~ OverallQual + GrLivArea, data = traindata)

Coefficients:
(Intercept)  OverallQual    GrLivArea  
 -108567.53     32823.44        59.44  
prediction2 <- predict(model2, testdata, type="response")
model2_output <- cbind(testdata, prediction2)

model2_output$log_prediction <- log(model2_output$prediction)
NaNs produced
model2_output$log_SalePrice <- log(model2_output$SalePrice)

rmse(model2_output$log_SalePrice, model2_output$log_prediction)
[1] NaN

This somehow actually worsened our model, I’m not exactly sure why. Anyway, let’s try to give it more parameters:

model3 <- lm(SalePrice~OverallQual+YearBuilt+YearRemodAdd+TotalBsmtSF+X1stFlrSF+
               GrLivArea+FullBath+TotRmsAbvGrd+GarageCars+GarageArea+ppsqf, data = traindata)
summary(model3)

Call:
lm(formula = SalePrice ~ OverallQual + YearBuilt + YearRemodAdd + 
    TotalBsmtSF + X1stFlrSF + GrLivArea + FullBath + TotRmsAbvGrd + 
    GarageCars + GarageArea + ppsqf, data = traindata)

Residuals:
    Min      1Q  Median      3Q     Max 
-250178   -5733     187    5790  143316 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -5.302e+04  7.824e+04  -0.678  0.49816    
OverallQual   2.007e+03  7.503e+02   2.675  0.00760 ** 
YearBuilt     3.202e+01  3.073e+01   1.042  0.29769    
YearRemodAdd -1.102e+02  3.780e+01  -2.917  0.00361 ** 
TotalBsmtSF   6.385e-01  2.702e+00   0.236  0.81323    
X1stFlrSF     1.725e+00  2.983e+00   0.578  0.56314    
GrLivArea     1.187e+02  2.733e+00  43.417  < 2e-16 ***
FullBath     -5.862e+02  1.581e+03  -0.371  0.71091    
TotRmsAbvGrd -8.881e+01  6.594e+02  -0.135  0.89288    
GarageCars    2.261e+01  1.807e+03   0.013  0.99002    
GarageArea    1.242e+00  6.070e+00   0.205  0.83794    
ppsqf         1.625e+03  3.068e+01  52.969  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19020 on 1085 degrees of freedom
Multiple R-squared:  0.9438,    Adjusted R-squared:  0.9432 
F-statistic:  1656 on 11 and 1085 DF,  p-value: < 2.2e-16

Linear regression models are clearly not the way to go. Let’s try some exponential ones.

model4 <- lm(log(SalePrice) ~ OverallQual, data = traindata)
summary(model4)

Call:
lm(formula = log(SalePrice) ~ OverallQual, data = traindata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.07183 -0.12568  0.01076  0.12451  0.71575 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.584273   0.031367  337.44   <2e-16 ***
OverallQual  0.236949   0.005036   47.05   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2284 on 1095 degrees of freedom
Multiple R-squared:  0.6691,    Adjusted R-squared:  0.6688 
F-statistic:  2214 on 1 and 1095 DF,  p-value: < 2.2e-16
plot(model4)

shapiro.test(model4$residuals)

    Shapiro-Wilk normality test

data:  model4$residuals
W = 0.98069, p-value = 6.697e-11
confint(model4)
                 2.5 %     97.5 %
(Intercept) 10.5227273 10.6458192
OverallQual  0.2270685  0.2468296
cor(traindata$SalePrice, model4$fitted.values)
[1] 0.7873731
model4

Call:
lm(formula = log(SalePrice) ~ OverallQual, data = traindata)

Coefficients:
(Intercept)  OverallQual  
    10.5843       0.2369  

As we can see, this model is fairly better than our previous attempts. Let’s try the same thing, with more parameters.

model5 <- lm(log(SalePrice)~OverallQual+YearBuilt+YearRemodAdd+TotalBsmtSF+X1stFlrSF+
               GrLivArea+FullBath+TotRmsAbvGrd+GarageCars+GarageArea+ppsqf, data = traindata)
summary(model5)

Call:
lm(formula = log(SalePrice) ~ OverallQual + YearBuilt + YearRemodAdd + 
    TotalBsmtSF + X1stFlrSF + GrLivArea + FullBath + TotRmsAbvGrd + 
    GarageCars + GarageArea + ppsqf, data = traindata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.09770 -0.03747  0.01641  0.05551  0.17188 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.553e+00  3.835e-01  19.692  < 2e-16 ***
OverallQual   2.396e-02  3.678e-03   6.514 1.12e-10 ***
YearBuilt     8.148e-04  1.506e-04   5.409 7.78e-08 ***
YearRemodAdd  5.140e-04  1.853e-04   2.774 0.005630 ** 
TotalBsmtSF   1.249e-05  1.324e-05   0.943 0.345966    
X1stFlrSF    -4.762e-06  1.462e-05  -0.326 0.744755    
GrLivArea     4.812e-04  1.340e-05  35.909  < 2e-16 ***
FullBath      5.133e-03  7.751e-03   0.662 0.507918    
TotRmsAbvGrd  1.210e-02  3.232e-03   3.742 0.000192 ***
GarageCars    3.333e-02  8.856e-03   3.764 0.000176 ***
GarageArea   -3.522e-05  2.975e-05  -1.184 0.236740    
ppsqf         6.915e-03  1.504e-04  45.979  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.09325 on 1085 degrees of freedom
Multiple R-squared:  0.9454,    Adjusted R-squared:  0.9448 
F-statistic:  1706 on 11 and 1085 DF,  p-value: < 2.2e-16
plot(model5)

shapiro.test(model5$residuals)

    Shapiro-Wilk normality test

data:  model5$residuals
W = 0.82302, p-value < 2.2e-16
confint(model5)
                     2.5 %       97.5 %
(Intercept)   6.800319e+00 8.305464e+00
OverallQual   1.674089e-02 3.117387e-02
YearBuilt     5.192312e-04 1.110337e-03
YearRemodAdd  1.504410e-04 8.775254e-04
TotalBsmtSF  -1.349985e-05 3.847460e-05
X1stFlrSF    -3.345656e-05 2.393213e-05
GrLivArea     4.548603e-04 5.074428e-04
FullBath     -1.007478e-02 2.034143e-02
TotRmsAbvGrd  5.754194e-03 1.843926e-02
GarageCars    1.595416e-02 5.070691e-02
GarageArea   -9.360557e-05 2.315829e-05
ppsqf         6.620080e-03 7.210296e-03
cor(traindata$SalePrice, model5$fitted.values)
[1] 0.9653806
model5

Call:
lm(formula = log(SalePrice) ~ OverallQual + YearBuilt + YearRemodAdd + 
    TotalBsmtSF + X1stFlrSF + GrLivArea + FullBath + TotRmsAbvGrd + 
    GarageCars + GarageArea + ppsqf, data = traindata)

Coefficients:
 (Intercept)   OverallQual     YearBuilt  YearRemodAdd   TotalBsmtSF     X1stFlrSF     GrLivArea      FullBath  
   7.553e+00     2.396e-02     8.148e-04     5.140e-04     1.249e-05    -4.762e-06     4.812e-04     5.133e-03  
TotRmsAbvGrd    GarageCars    GarageArea         ppsqf  
   1.210e-02     3.333e-02    -3.522e-05     6.915e-03  

As we can see, our model did not improve, on the opposite, it worsened.

I would love to continue working on this exercise, improving my models, examining the target parameter in relation to groups of objects, such as city areas, or types of buildings, but sadly I’m out of time.

As a last thing, I tried to check out the whether the prices and the sizes of living areas are affected by the neighborhood. We can only see some correlation in the outliers.

plot(data$GrLivArea, data$SalePrice, col=data$Neighborhood)

As a conclusion, I still have a lot to learn and would require a lot more time to properly solve the problem. My current best model only uses a single parameter. If I have the time for it in the future, I’ll return to try to solve it properly.

LS0tCnRpdGxlOiAiU2Vjb25kIERhdGEgUHJvZHVjdCIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKVGhlIHNlY29uZCBkYXRhIHByb2R1Y3QgZm9yIG15IHVuaXZlcnNpdHkgY291cnNlLgpJdCdzIGFpbSBpcyB0byBwcmVkaWN0IGhvdXNpbmcgcHJpY2VzIHVzaW5nIHJlZ3Jlc3Npb24uICAKVGhlIHVzZWQgZGF0YSBpcyB0aGUgQW1lcyBIb3VzaW5nIGRhdGFzZXQsIGZyb20ga2FnZ2xlLmNvbQpJdCdzIHNwbGl0IGludG8gdHdvIGZpbGVzLCBhIHRyYWluLmNzdiBhbmQgYSB0ZXN0LmNzdi4gIApDcmVhdGVkIGJ5OiBEb2Jvc2kgUMOpdGVyIE1XNzlPTgoKCgojIyBCdXNpbmVzcyBVbmRlcnN0YW5kaW5nOgoKRmlyc3QgbGV0J3MgZGVjaWRlIHdoYXQgdGhlIHByb2JsZW0gZXhhY3RseSBpcywgYW5kIHdoYXQgZG8gd2Ugd2FudCB0byBhY2hpZXZlLgpUaGUgZXhhY3Qgb3ZlcnZpZXcgb2YgdGhlIHByb2JsZW0gY2FuIGJlIHJlYWQgaGVyZTogCmh0dHBzOi8vd3d3LmthZ2dsZS5jb20vYy9ob3VzZS1wcmljZXMtYWR2YW5jZWQtcmVncmVzc2lvbi10ZWNobmlxdWVzL292ZXJ2aWV3CgpJbiBzdW1tYXJ5LCB3ZSBoYXZlIHRvIGNyZWF0ZSBhIG1vZGVsLCB3aGljaCBjYW4gcHJlZGljdCB0aGUgc2FsZSBwcmljZSBvZiBhCmhvdXNlLCBnaXZlbiBpdCdzIHBhcmFtZXRlcnMsIHVzaW5nIHJlZ3Jlc3Npb24uCgpNeSBwbGFuIGlzIHRoZSBmb2xsb3dpbmc6CgoxLiBFeHBsb3JlIHRoZSBkYXRhLgoyLiBDbGVhbiBhbmQgdHJhbnNmb3JtIGl0IGludG8gYSB1c2FibGUgZm9ybWF0LgozLiBDaG9vc2UsIGJ1aWxkIGFuZCB2ZXJpZnkgYSBtb2RlbC4KNC4gRXZhbHVhdGUgaXQuCjUuIFVzZSBpdCB0byBwcmVkaWN0IG9uIHRoZSBmaW5hbCBkYXRhc2V0LgoKCkZpcnN0IHRoaW5ncyBmaXJzdCwgbGV0J3MgbG9hZCB0aGUgZGF0YSBhbmQgdGFrZSBhIHF1aWNrIGxvb2sgYXQgaXQ6CmBgYHtyfQpkYXRhIDwtIHJlYWQuY3N2KCIuL3RyYWluLmNzdiIpCmRhdGEKYGBgCgpXZSBoYXZlIG1vcmUgdGhhbiAxNDAwIG9iamVjdHMgYW5kIDc5IHBhcmFtZXRlcnMuIFdlIGhhdmUgTkFzIGFuZCBldmVyeXRoaW5nCmlzIHNpbXBseSBlaXRoZXIgY2hhcmFjdGVycyBvciBpbnRlZ2Vycy4gIApUbyBwcm9wZXJseSBwcmVwYXJlIHRoZSBkYXRhIHdlIGhhdmUgdG8gCmhhbmRsZSB0aGUgZW1wdHkgdmFsdWVzIG9uZSB3YXkgb3IgYW5vdGhlciwgYW5kIGFsc28gY29udmVydCB0aGUgY2hhcmFjdGVycyB0bwpmYWN0b3JzLgoKSXQgc2VlbXMgbGlrZSB3ZSBoYXZlIGEgY291cGxlIG9mIHBhcmFtZXRlcnMgdGhhdCBhcmUgbW9zdGx5IE5BczoKYGBge3J9CmFsbGV5X25hX3BlcmNlbnRhZ2UgPC0gc3VtKGlzLm5hKGRhdGEkQWxsZXkpKS8xNDYwKjEwMApwb29sX25hX3BlcmNlbnRhZ2UgPC0gc3VtKGlzLm5hKGRhdGEkUG9vbFFDKSkvMTQ2MCoxMDAKZmVuY2VfbmFfcGVyY2VudGFnZSA8LSBzdW0oaXMubmEoZGF0YSRGZW5jZSkpLzE0NjAqMTAwCmZlYXR1cmVfbmFfcGVyY2VudGFnZSA8LSBzdW0oaXMubmEoZGF0YSRNaXNjRmVhdHVyZSkpLzE0NjAqMTAwCgpwYXN0ZSgiUGVyY2VudGFnZSBvZiBOQXMgaW4gdGhlIGZpZWxkIEFsbGV5OiIsIGFsbGV5X25hX3BlcmNlbnRhZ2UsICIlIikKcGFzdGUoIlBlcmNlbnRhZ2Ugb2YgTkFzIGluIHRoZSBmaWVsZCBQb29sUUM6IiwgcG9vbF9uYV9wZXJjZW50YWdlLCAiJSIpCnBhc3RlKCJQZXJjZW50YWdlIG9mIE5BcyBpbiB0aGUgZmllbGQgRmVuY2U6IiwgZmVuY2VfbmFfcGVyY2VudGFnZSwgIiUiKQpwYXN0ZSgiUGVyY2VudGFnZSBvZiBOQXMgaW4gdGhlIGZpZWxkIE1pc2NGZWF0dXJlOiIsIGZlYXR1cmVfbmFfcGVyY2VudGFnZSwgIiUiKQpgYGAKCldpdGggdGhlIGhlbHAgb2YgZGF0YV9kZXNjcmlwdGlvbi50eHQgd2UgY2FuIGRlY2lwaGVyIHdoYXQgZG8gdGhlc2UgbWVhbjogIAooT25seSB0aGUgcmVsZXZhbnQgcGFydHMgaGVyZSwgcmVhZCB0aGUgcmVzdCBmcm9tIHRoZSBmaWxlIGlmIHlvdSBhcmUgaW50ZXJlc3RlZC4pCgpBbGxleTogIApUeXBlIG9mIGFsbGV5IGFjY2VzcyB0byBwcm9wZXJ0eS4KCiAgICAgICBHcnZsCUdyYXZlbAogICAgICAgUGF2ZQlQYXZlZAogICAgICAgTkEgCU5vIGFsbGV5IGFjY2VzcwoKUG9vbFFDOiAgClBvb2wgcXVhbGl0eS4KCiAgICAgICBFeAlFeGNlbGxlbnQKICAgICAgIEdkCUdvb2QKICAgICAgIFRBCUF2ZXJhZ2UvVHlwaWNhbAogICAgICAgRmEJRmFpcgogICAgICAgTkEJTm8gUG9vbAoKRmVuY2U6ICAKRmVuY2UgcXVhbGl0eS4KCQkKICAgICAgIEdkUHJ2CUdvb2QgUHJpdmFjeQogICAgICAgTW5QcnYJTWluaW11bSBQcml2YWN5CiAgICAgICBHZFdvCUdvb2QgV29vZAogICAgICAgTW5XdwlNaW5pbXVtIFdvb2QvV2lyZQogICAgICAgTkEJTm8gRmVuY2UKCk1pc2NGZWF0dXJlOiAgCk1pc2NlbGxhbmVvdXMgZmVhdHVyZSBub3QgY292ZXJlZCBpbiBvdGhlciBjYXRlZ29yaWVzLgoJCQogICAgICAgRWxldglFbGV2YXRvcgogICAgICAgR2FyMgkybmQgR2FyYWdlIChpZiBub3QgZGVzY3JpYmVkIGluIGdhcmFnZSBzZWN0aW9uKQogICAgICAgT3RocglPdGhlcgogICAgICAgU2hlZAlTaGVkIChvdmVyIDEwMCBTRikKICAgICAgIFRlbkMJVGVubmlzIENvdXJ0CiAgICAgICBOQQlOb25lCgoKQXMgd2UgY2FuIHNlZSwgdGhleSBkb24ndCBtZWFuIHRoYXQgd2UgaGF2ZSBubyBpbmZvcm1hdGlvbiBvbiB0aG9zZSBwYXJhbWV0ZXJzCm9mIHRoZSBidWlsZGluZ3MuIFJhdGhlciB0aGVpciBtZWFuaW5nIGlzIHNpbXBseSB0aGF0IHRoZXkgbGFjayB0aGUgdGhpbmdzCmRlc2NyaWJlZCBieSB0aG9zZSBwYXJhbWV0ZXJzLiBUaGlzIGlzIGltcG9ydGFudCBpbmZvcm1hdGlvbiwgd2UgY2FuJ3QganVzdCBkcm9wLCBvciBndWVzcyB0aGVtIGZyb20gYmFzZWQgb24gdGhlIG90aGVycy4KCgoKIyMgRXhwbG9yYXRvcnkgRGF0YSBBbmFseXNpczoKCkxldCdzIGZpbmQgb3V0IG1vcmUgYWJvdXQgdGhlIGRhdGEncyBjaGFyYWN0ZXJpc3RpY3MhICAKTGV0J3MgdGFrZSBhIHF1aWNrIGxvb2sgYXQgdGhlIHBhcmFtZXRlcnMgb2YgdGhlIGRhdGFzZXQ6CmBgYHtyfQpuYW1lcyhkYXRhKQpgYGAKQXMgd2UgY2FuIHNlZSwgd2UgaGF2ZSByb3VnaGx5IDgwIHBhcmFtZXRlcnMuICAKTGV0J3MgdGFrZSBhIGxvb2sgYXQgdGhlIHR5cGVzIG9mIHRoZSBwYXJhbWV0ZXJzOgpgYGB7cn0Kc3RyKGRhdGEpCmBgYApBbGwgb2Ygb3VyIHBhcmFtZXRlcnMgYXJlIGVpdGhlciBjaGFyYWN0ZXIgb3IgaW50ZWdlciB2ZWN0b3JzLiAgCkxldCdzIHRha2UgYSBsb29rIGF0IHRoZSBzdW1tYXJ5OgpgYGB7cn0Kc3VtbWFyeShkYXRhKQpgYGAKQSBsb3Qgb2Ygb3VyIGF0dHJpYnV0ZXMgYXJlIGNoYXJhY3RlciB2ZWN0b3JzLCB3aGljaCB3ZSBjYW4ndCBzdW1tYXJpemUgdGhpcyB3YXkuCgoKIyMgT25lIGRpbWVuc2lvbmFsIGV4YW1pbmF0aW9uCgpJbiB3aGljaCB3ZSBmaW5kIG91dCBtb3JlIGFib3V0IGdpdmVuIHBhcmFtZXRlcnMgb2YgdGhlIGRhdGEuICAKTGV0J3MgdGFrZSBhIHZpc3VhbCBsb29rOgpgYGB7cn0KcGxvdChkYXRhJExvdEFyZWEpCnBsb3QoZGF0YSRMb3RBcmVhLCB5bGltID0gYygxMDAwLCAyMDAwMCkpCmBgYAoKV2UgY2FuIHNlZSwgdGhhdCBpbiB0ZXJtcyBvZiBBcmVhLCBtb3N0IG9mIHRoZSBwcm9wZXJ0aWVzIGFyZSBiZXR3ZWVuIDEwMDAgYW5kCjIwMDAwIHNxdWFyZSBmZWV0LCB3aXRoIGEgZmV3IG91dGxpZXJzLiAgCkxldCdzIGNoZWNrIG91dCBob3cgbWFueSBob3VzZXMgd2VyZSBidWlsdCBpbiBlYWNoIHllYXI6CgpgYGB7cn0KaGlzdChkYXRhJFllYXJCdWlsdCkKYGBgCgpXZSBjYW4gYWxzbyBzZWUsIGEgdGVuZGVuY3kgdG93YXJkcyBuZXdseSBidWlsdCBob21lcy4KCmBgYHtyfQpwbG90KHRhYmxlKGRhdGEkRmlyZXBsYWNlcykpCmdyaWQoKQpgYGAKTXVsdGlkaW1lbnNpb25hbCBleGFtaW5hdGlvbjoKCk5vdyBsZXQncyB0YWtlIGEgbG9vayBhdCBtdWx0aXBsZSBwYXJhbWV0ZXJzIGF0IHRoZSBzYW1lIHRpbWU6CmBgYHtyfQpsaWJyYXJ5KGNhcikKc2NhdHRlcnBsb3QoZGF0YSRZZWFyQnVpbHQsIGRhdGEkU2FsZVByaWNlLCByZWdMaW5lID0gbGlzdChjb2w9ImdyZWVuIiksIHNtb290aD1saXN0KGNvbC5zbW9vdGg9InJlZCIsIGNvbC5zcHJlYWQ9ImJsYWNrIikpCmBgYAoKV2UgbWlnaHQgc2VlIHNvbWUga2luZCBvZiBleHBvbmVudGlhbCBwYXR0ZXJuLCBnaXZlbiB0aGUgaGlnaGVyIHByaWNlcyBvZiBuZXdseSBidWlsdCBob21lcy4KCmBgYHtyfQpzY2F0dGVycGxvdChkYXRhJExvdEFyZWEsIGRhdGEkU2FsZVByaWNlLCB4bGltID0gYygxMDAwLCAyMDAwMCksIHJlZ0xpbmUgPSBsaXN0KGNvbD0iZ3JlZW4iKSwgc21vb3RoPWxpc3QoY29sLnNtb290aD0icmVkIiwgY29sLnNwcmVhZD0iYmxhY2siKSkKYGBgCgpXZSBjYW4ndCBzZWUgYW55IGNsZWFyIGNvcnJlbGF0aW9uIGJldHdlZW4gdGhlIGFyZWEgYW5kIHRoZSBwcmljZSBvZiBhIHByb3BlcnR5LgoKCkp1c3Qgb3V0IG9mIGN1cmlvc2l0eSwgSSB0cmllZCB0byBkcmF3IHRoZSBmZWF0dXJlIHBsb3Qgb2YgdGhlIGRhdGEuClRvIG15IHN1cnByaXNlLCB3aXRoIGEgY291cGxlIG9mIHR3ZWFrcywgYW5kIGdpdmVuIHNvbWUgdGltZSBpdCBhY3R1YWxseSB3b3JrZWQ6CmBgYHtyfQoKIyBUaGlzIHRvb2sgYSBjb3VwbGUgbWludXRlcywgYnV0IGl0IHdvcmtlZC4gSXQncyBhYm91dCA2NGsgc3F1YXJlIG1ldGVycy4KIyBUaGUgcGxvdCBpcyBiYXNpY2FsbHkgdW5yZWFkYWJsZSwgYnV0IGl0IHNob3dzIHRoYXQgdGhlcmUgaXMgY29ycmVsYXRpb24KIyBiZXR3ZWVuIGEgY291cGxlIG9mIHRoZSBwYXJhbWV0ZXJzLgoKIyBwZGYoZmlsZSA9ICIvaG9tZS9wZXRlci90ZXN0LnBkZiIsCiMgICAgIHdpZHRoID0gMTAwMDAsCiMgICAgIGhlaWdodCA9IDEwMDAwKQoKIyBwbG90KGRhdGEpCgojIGRldi5vZmYoKQpgYGAKCkFsdGhvdWdoIHdlIGNvdWxkbid0IHJlYWxseSBsZWFybiBhbnl0aGluZyBmcm9tIGl0LCBkdWUgdG8gaXQncyBzaXplIGl0J3MgCnVucmVhZGFibGUuCgoKTGV0J3MgY2hlY2sgb3V0IHRoZSBjb3ZhcmlhbmNlIG1hdHJpeDoKYGBge3J9CiMgY292KGRhdGEpCmBgYAoKVGhpcyBkb2Vzbid0IHdvcmsgYmVjYXVzZSB0aGVyZSBhcmUgbm9uIG51bWVyaWMgb3IgbG9naWNhbCB2YWx1ZXMgaW4gb3VyCmRhdGFmcmFtZSBzdGlsbC4gSXQncyB0aW1lIHdlIGNsZWFuZWQgdXAgdGhlIGRhdGEgYSBiaXQuIEJ1dCBiZWZvcmUgd2UgZG8gdGhhdCwKbGV0J3MgdHJ5IHRvIHRha2UgYSBsb29rIGF0IHRoZSBjb3JyZWxhdGlvbnMgYmV0d2VlbiBhIGNvdXBsZSBvdGhlciBwYXJhbWV0ZXJzLgoKYGBge3J9CmxpYnJhcnkoY29ycnBsb3QpCmNvcnMgPC0gY29yKGRhdGFbLGMoNSwxOSwyMCwzOSw0Nyw2Myw3Miw3Nyw4MSldLCB1c2UgPSAiY29tcGxldGUub2JzIikKY29ycnBsb3QoY29ycywgdHlwZSA9ICJsb3dlciIpCmBgYAoKV2UgY2FuIHNlZSBjb3JyZWxhdGlvbiBiZXR3ZWVuIHRoZSBwcmljZSBhbmQgYSBjb3VwbGUgb2YgcGFyYW1ldGVycywgc3VjaCBhcyB0aGUKc2l6ZSBvZiB0aGUgbGl2aW5nIGFyZWEuCgoKCgojIyBEYXRhIENsZWFuaW5nOgoKVGhlcmUgaXMgYSBsb3QgdG8gZG8uIFdlIGhhdmUgTkFzLCBhbmQgbm9uIG51bWVyaWMgdmFsdWVzIGV2ZXJ5d2hlcmUuCkxldCdzIHN0YXJ0IGJ5IGRlYWxpbmcgd2l0aCB0aGUgTkFzIGZpcnN0LiAgCldlIG5lZWQgdG8gZmluZCBvdXQgd2hpY2ggY29sdW1ucyBjb250YWluIGFueSBOQXM6CmBgYHtyfQpuYV9jb2xzIDwtIG5hbWVzKHdoaWNoKGNvbFN1bXMoaXMubmEoZGF0YSkpID4gMCkpCm5hX2NvbHMKYGBgCldlIGhhdmUgMTkgY29sdW1ucyBjb250YWluaW5nIE5BcywgbGV0J3MgZmluZCBvdXQgbW9yZSBhYm91dCB0aGVtLiAgCkxldCdzIGZpbmQgb3V0IGhvdyBtYW55IE5BcyBkbyB0aGVzZSBjb2x1bW5zIGhhdmU6CmBgYHtyfQpnZXRfbmFfY291bnQgPC0gZnVuY3Rpb24oY29sdW1uX25hbWUpIHsKICAgIHN1bShpcy5uYShkYXRhW2NvbHVtbl9uYW1lXSkpCn0KCm5hX2NvdW50cyA8LSBkYXRhLmZyYW1lKHNhcHBseShuYV9jb2xzLCBnZXRfbmFfY291bnQpKQoKbGlicmFyeShkYXRhLnRhYmxlKQpuYV9zdGF0cyA8LSB0cmFuc3Bvc2UobmFfY291bnRzKQoKY29sbmFtZXMobmFfc3RhdHMpIDwtIG5hX2NvbHMKcm93bmFtZXMobmFfc3RhdHMpIDwtIGMoIk5BIGNvdW50IikKCmNhbGNfbmFfcGVyY2VudGFnZSA8LSBmdW5jdGlvbihjb2x1bW5fbmFtZSkgewogICAgZ2V0X25hX2NvdW50KGNvbHVtbl9uYW1lID0gY29sdW1uX25hbWUpL25yb3coZGF0YSkgKiAxMDAKfQoKbmFfc3RhdHNbbnJvdyhuYV9zdGF0cykgKyAxLF0gPSBzYXBwbHkobmFfY29scywgY2FsY19uYV9wZXJjZW50YWdlKQpyb3duYW1lcyhuYV9zdGF0cykgPC0gYygiTkEgY291bnQiLCAiTkEgcGVyY2VudGFnZSIpCgpuYV9zdGF0cwpgYGAKCkFzIHdlIGNhbiBzZWUsIHdlIGhhdmUgc29tZSBwYXJhbWV0ZXJzIHRoYXQgYXJlIG1vc3RseSBOQXMsIHdoaWxlIG90aGVycyBvbmx5IApjb250YWluIGEgZmV3IG9mIHRoZW0uICAKTGV0J3MgZGVhbCB3aXRoIHRoZW0gYXBwcm9wcmlhdGVseSwgbm93IHRoYXQgd2Uga25vdyBtb3JlIGFib3V0IHRoZW0uICAKRmlyc3QsIHRoZSBMb3RGcm9udGFnZSBwYXJhbWV0ZXI6CmBgYHtyfQp1bmlxdWUoZGF0YSRMb3RGcm9udGFnZSkKYGBgCgoKVGhlIGRlc2NyaXB0aW9uIGRvZXNuJ3Qgc2F5IGFueXRoaW5nIGFib3V0IE5BcyBpbiB0aGlzIHBhcmFtZXRlciwgYnV0IGFzIHdlIGNhbgpzZWUsIHRoZXJlIGFyZW4ndCBhbnkgemVyb3MgaGVyZS4gU28gSSdsbCBhc3N1bWUgdGhhdCBOQXMgbWVhbiB6ZXJvIGhlcmUgYXMgd2VsbCwKYXMgaXQgZG9lcyBpbiBtb3N0IG9mIHRoZSBvdGhlciBwYXJhbWV0ZXJzLiAgCkxldCdzIGZpbGwgdGhlbSBpbiBub3c6CmBgYHtyfQpkYXRhW2lzLm5hKGRhdGEkTG90RnJvbnRhZ2UpLF0kTG90RnJvbnRhZ2UgPC0gMApgYGAKCgpUaGUgQWxsZXkgcGFyYW1ldGVyOiAgClRoZSBkYXRhX2Rlc2NyaXB0aW9uLnR4dCBzYXlzIHRoYXQgTkFzIGluIHRoaXMgcGFyYW1ldGVyIG1lYW4sIHRoYXQgdGhlcmUgaXMgbm8KYWxsZXkgYWNjZXNzLCB0byB0aGUgZ2l2ZW4gcHJvcGVydHkuCmBgYHtyfQp1bmlxdWUoZGF0YSRBbGxleSkKYGBgCgpMYXRlciBJJ2xsIHByb2JhYmx5IGNvbnZlcnQgYWxsIGNoYXJhY3Rlcgp2ZWN0b3JzIHRvIGZhY3RvcnMsIHNvIGxldCdzIGxlYXZlIHRoaXMgYXMgaXMuICAKTm93IGZvciB0aGUgTWFzb25yeSB2ZW5lZXIgdHlwZToKYGBge3J9Cm1zX3R5cGVzIDwtIHVuaXF1ZShkYXRhJE1hc1ZuclR5cGUpCm1zX3R5cGVzCmBgYAoKV2UgaGF2ZSBhIGhhbmRmdWwgb2YgTkFzLCBidXQgaGVyZSB0aGV5IGRvIG5vdCBzaW1wbHkgbWVhbiB0aGF0IHRoZXJlIGlzIG5vIHN1Y2gKdGhpbmcgYXMgd2hhdCdzIGJlaW5nIGRlc2NyaWJlZCBieSB0aGUgcGFyYW1ldGVyLiBXZSBoYXZlIHRvIGFjdHVhbGx5IGZpbGwgdGhlbQppbi4gIApMZXQncyBkbyBzbyBieSB0aGUgbW9zdCBmcmVxdWVudCB2YWx1ZToKCmBgYHtyfQpnZXRfbXNfY291bnQgPC0gZnVuY3Rpb24odW5pcXVlX3ZhbHVlKXsKICAgIHN1bShkYXRhJE1hc1ZuclR5cGUgPT0gdW5pcXVlX3ZhbHVlLCBuYS5ybSA9IFQpCn0KCnNhcHBseShtc190eXBlcywgZ2V0X21zX2NvdW50KQpgYGAKCkFzIHdlIGNhbiBzZWUsIHRoZSBtb3N0IGNvbW1vbiBvcHRpb24gaXMgTm9uZSwgc28gbGV0J3MgYXNzdW1lIHRoYXQgTkFzIGFyZSBOb25lOgpgYGB7cn0KZGF0YVtpcy5uYShkYXRhJE1hc1ZuclR5cGUpLF0kTWFzVm5yVHlwZSA8LSAiTm9uZSIKYGBgCgpXZSBoYXZlIHRvIGRvIHRoZSBzYW1lIGZvciBNYXNvbnJ5IHZlbmVlciBhcmVhIGFzIHdlbGwsIGJ1dCB3aXRoIDBzIHRoaXMgdGltZToKYGBge3J9CmRhdGFbaXMubmEoZGF0YSRNYXNWbnJBcmVhKSxdJE1hc1ZuckFyZWEgPC0gMApgYGAKCkJzbXRRdWFsIGlzIG5leHQ6IApgYGB7cn0KdW5pcXVlKGRhdGEkQnNtdFF1YWwpCmBgYAoKQWNjb3JkaW5nIHRvIHRoZSBkZXNjcmlwdGlvbiwgTkFzIGhlcmUgbWVhbiwgdGhhdCB0aGUgcHJvcGVydHkgaGFzIG5vIGJhc2VtZW50LgpMZXQncyBsZWF2ZSB0aGlzIGFzIGlzLiAgClRoZSBzYW1lIGlzIHRydWUgZm9yIEJzbXRDb25kLCBCc210RXhwb3N1cmUsIEJzbXRGaW5TRjEsIEJzbXRGaW5UeXBlMSBhbmQgQnNtdEZpblR5cGUyLiAgCkVsZWN0cmljYWwgaXMgdXAgbmV4dDoKYGBge3J9CmVsZWNfdHlwZXMgPC0gdW5pcXVlKGRhdGEkRWxlY3RyaWNhbCkKZWxlY190eXBlcwpgYGAKClRoZSBkZXNjcmlwdGlvbiBkb2Vzbid0IHNheSBhbnl0aGluZyBhYm91dCB0aGUgb25lIG1pc3NpbmcgdmFsdWUsIHNvIGxldCdzIGZpbGwKaXQgd2l0aCB0aGUgbW9zdCBmcmVxdWVudCB2YWx1ZToKYGBge3J9CgojIFRPRE8gSSBuZWVkIHRvIGNoYW5nZSB0aGVzZSB0byByZXVzYWJsZSBtZXRob2RzLgoKZ2V0X2VsZWNfY291bnQgPC0gZnVuY3Rpb24odW5pcXVlX3ZhbHVlKXsKICAgIHN1bShkYXRhJEVsZWN0cmljYWwgPT0gdW5pcXVlX3ZhbHVlLCBuYS5ybSA9IFQpCn0KCnNhcHBseShlbGVjX3R5cGVzLCBnZXRfZWxlY19jb3VudCkKYGBgCgpBcyB3ZSBjYW4gc2VlLCB0aGUgU3RhbmRhcmQgQnJlYWtlciBpcyB0aGUgbW9zdCBjb21tb24sIGxldCdzIGFzc3VtZSB0aGUgbWlzc2luZwp2YWx1ZSBpcyB0aGF0OgpgYGB7cn0KZGF0YVtpcy5uYShkYXRhJEVsZWN0cmljYWwpLF0kRWxlY3RyaWNhbCA8LSAiU0Jya3IiCmBgYAoKRmlyZXBsYWNlUXUgaXMgbmV4dDogIApUaGUgZGVzY3JpcHRpb24gc2F5cyB0aGF0IE5BcyBoZXJlIG1lYW4gdGhhdCB0aGVyZSBpcyBubyBmaXJlcGxhY2UsIHNvIGxldCdzIApsZWF2ZSB0aGlzIGFzIGlzLiAgClRoZSBzYW1lIGRlYWwgZm9yIGFsbCB0aGUgcGFyYW1ldGVycyBkZXNjcmliaW5nIHRoZSBnYXJhZ2VzLiAgClBvb2xRQyBhbmQgRmVuY2UgYWxzbyBiZWhhdmUgdGhlIGV4YWN0IHNhbWUgd2F5LiAgCkZpbmFsbHkgdGhlIGxhc3Qgb25lLCBNaXNjRmVhdHVyZS4gVGhpcyBvbmUgaXMgc2ltaWxhciwgTkFzIHNpbXBseSBtZWFuIHRoYXQKdGhlcmUgYXJlbid0IGFueSBtaXNjIGZlYXR1cmVzLiAgCgpGaW5hbGx5IGFmdGVyIGFsbCB0aGlzIGhhcmQgd29yaywgd2Ugc2hvdWxkbid0IGhhdmUgYW55IE5BcyBsZWZ0IGluIG91cgpkYXRhZnJhbWUsIHdoZXJlIHRoZXkgZG9uJ3QgbWFrZSBhbnkgc2Vuc2UKTGV0J3MgY2hlY2sgd2hldGhlciB0aGF0J3MgdHJ1ZToKYGBge3J9Cm5hbWVzKHdoaWNoKGNvbFN1bXMoaXMubmEoZGF0YSkpID4gMCkpCmBgYAoKSXQgaXMhCgpBZnRlciB3ZSd2ZSBkZWFsdCB3aXRoIGFsbCBvZiB0aGUgTkFzLCBsZXQncyBjaGVjayB3aGV0aGVyIGV2ZXJ5dGhpbmcgaXMgdGhlIApjb3JyZWN0IHR5cGU6CmBgYHtyfQpzdHIoZGF0YSkKYGBgCgpOb3RoaW5nIHNlZW1zIG91dCBvZiBvcmRlciwgYnV0IHdlIHN0aWxsIGhhdmUgYSBidW5jaCBvZiBjaGFyYWN0ZXIgdmVjdG9ycy4KV2UgbmVlZCB0byBlbmNvZGUgdGhlbSBpbiBhIHdheSwgdGhhdCBvdXIgbW9kZWxzIGNhbiB1c2UuIExldCdzIGNvbnZlcnQgdGhlbSB0bwpmYWN0b3JzLiBUaGlzIHdheSBSIGNhbiBhdXRvbWF0aWNhbGx5IGR1bW15IGNvZGUgdGhlbSB3aGVuIGJ1aWxkaW5nIG1vZGVscy4KCkZpcnN0IHRoaW5ncyBmaXJzdCwgd2UgaGF2ZSB0byBmaW5kIG91dCB3aGljaCBwYXJhbWV0ZXJzIGFyZSBzdHJpbmdzLCBzbyB3ZSBjYW4Ka25vdyB3aGljaCBvbmVzIHRvIGNvbnZlcnQgdG8gZmFjdG9yczoKYGBge3J9CmNoYXJfcGFybXMgPC0gY29sbmFtZXMoZGF0YVtzYXBwbHkoZGF0YSwgaXMuY2hhcmFjdGVyKV0pCmNoYXJfcGFybXMKYGBgCgpBcyB3ZSBjYW4gc2VlLCB3ZSBoYXZlIGEgYml0IG1vcmUgdGhhbiA0MCBwYXJhbWV0ZXJzIHdoaWNoIGFyZSBjaGFyYWN0ZXJzLgpMZXQncyBjb252ZXJ0IHRoZW0gdG8gZmFjdG9yczoKYGBge3J9CmRhdGFbY2hhcl9wYXJtc10gPC0gbGFwcGx5KGRhdGFbY2hhcl9wYXJtc10sIGZhY3RvcikKYGBgCgpMZXQncyBjaGVjayB3aGV0aGVyIHdlIHdlcmUgc3VjY2Vzc2Z1bDoKYGBge3J9CnN0cihkYXRhKQpgYGAKCldlIHdlcmUhICAKICAKTm93LCBhZnRlciBjbGVhbmluZyB0aGUgZGF0YSwgbGV0J3MgY2hlY2sgb3V0IHRoZSBjb3JyZWxhdGlvbnMsIHRvIHNlZSwgd2hpY2gKcGFyYW1ldGVycyBzaG91bGQgd2UgcGF5IG1vcmUgYXR0ZW50aW9uIHRvLiAgCkZpcnN0IGxldCdzIHNlZSBmb3IgdGhlIG51bWVyaWMgdmFsdWVzOgpgYGB7cn0KbnVtX3Bhcm1zIDwtIGNvbG5hbWVzKGRhdGFbc2FwcGx5KGRhdGEsIGlzLm51bWVyaWMpXSkKbnVtX3Bhcm1zCgpudW1jb3JzIDwtIGNvcihkYXRhWyxudW1fcGFybXNdLCB1c2UgPSAiY29tcGxldGUub2JzIikKY29ycnBsb3QobnVtY29ycywgdHlwZSA9ICJsb3dlciIpCmBgYApUaGlzIGlzIGhhcmQgdG8gcmVhZCwgYnV0IHdlIGNhbiBhbHJlYWR5IHNlZSB0aGF0IHdlIGRvbid0IG5lZWQgYWxsIG9mIHRoZXNlIApwYXJhbWV0ZXJzLiAgCkxldCdzIGNoZWNrIG91dCB0aGUgbW9yZSByZWxldmFudCBvbmVzOgpgYGB7cn0KcmVsZXZhbnRfbmFtZXMgPC0gbmFtZXMobnVtY29yc1szOCxudW1jb3JzWzM4LF0gPiAwLjVdKQoKcmVsY29ycyA8LSBjb3IoZGF0YVsscmVsZXZhbnRfbmFtZXNdLCB1c2UgPSAiY29tcGxldGUub2JzIikKY29ycnBsb3QocmVsY29ycywgdHlwZSA9ICJsb3dlciIpCmBgYApMZXQncyBjaGVjayB0aGVtIG91dDoKYGBge3J9CnNjYXR0ZXJwbG90KGRhdGEkT3ZlcmFsbFF1YWwsIGRhdGEkU2FsZVByaWNlLCByZWdMaW5lID0gbGlzdChjb2w9ImdyZWVuIiksIHNtb290aD1saXN0KGNvbC5zbW9vdGg9InJlZCIsIGNvbC5zcHJlYWQ9ImJsYWNrIikpCgpzY2F0dGVycGxvdChkYXRhJEdyTGl2QXJlYSwgZGF0YSRTYWxlUHJpY2UsIHhsaW0gPSBjKDI1MCwgMzAwMCksIHlsaW0gPSBjKDAsIDUwMDAwMCksIHJlZ0xpbmUgPSBsaXN0KGNvbD0iZ3JlZW4iKSwgc21vb3RoPWxpc3QoY29sLnNtb290aD0icmVkIiwgY29sLnNwcmVhZD0iYmxhY2siKSkKCnNjYXR0ZXJwbG90KGRhdGEkR3JMaXZBcmVhLCBkYXRhJFRvdFJtc0FidkdyZCwgcmVnTGluZSA9IGxpc3QoY29sPSJncmVlbiIpLCBzbW9vdGg9bGlzdChjb2wuc21vb3RoPSJyZWQiLCBjb2wuc3ByZWFkPSJibGFjayIpKQoKc2NhdHRlcnBsb3QoZGF0YSRPdmVyYWxsUXVhbCwgZGF0YSRHckxpdkFyZWEsIHJlZ0xpbmUgPSBsaXN0KGNvbD0iZ3JlZW4iKSwgc21vb3RoPWxpc3QoY29sLnNtb290aD0icmVkIiwgY29sLnNwcmVhZD0iYmxhY2siKSkKCnBhaXJzKGRhdGFbLGMoIlNhbGVQcmljZSIsICJHckxpdkFyZWEiLCAiVG90Um1zQWJ2R3JkIildKQpgYGAKCldlIGNhbiBzZWUgYSBjb3VwbGUgb2Ygb2J2aW91cyBjb3JyZWxhdGlvbnMsIHRoYXQgZG9uJ3QgbWVhbiBhbnl0aGluZywgc3VjaCBhczoKYmV0d2VlbiB0aGUgR2FyYWdlQXJlYSBhbmQgR2FyYWdlQ2FycywgYW5kIEdyTGl2QXJlYSBhbmQgVG90Um1zQWJ2R3JkLgogIApMZXQncyBjYWxjdWxhdGUgYSBuZXcgcGFyYW1ldGVyLCB0aGUgcHJpY2UgcGVyIHNxdWFyZSBmZWV0OgpgYGB7cn0KZGF0YSRwcHNxZiA8LSBkYXRhJFNhbGVQcmljZSAvIGRhdGEkR3JMaXZBcmVhCnNjYXR0ZXJwbG90KGRhdGEkT3ZlcmFsbFF1YWwsIGRhdGEkcHBzcWYsIHJlZ0xpbmUgPSBsaXN0KGNvbD0iZ3JlZW4iKSwgc21vb3RoPWxpc3QoY29sLnNtb290aD0icmVkIiwgY29sLnNwcmVhZD0iYmxhY2siKSkKYGBgCkxldCdzIGNoZWNrIG91dCB0aGUgY29ycmVsYXRpb24gYmV0d2VlbiB0aGlzIG5ldyBwYXJhbWV0ZXJzIGFuZCB0aGUgb2xkIG9uZXM6CmBgYHtyfQpyZWxldmFudF9uYW1lczIgPC0gYyhyZWxldmFudF9uYW1lcywgInBwc3FmIikKcmVsZXZhbnRfbmFtZXMyCgpyZWxjb3JzMiA8LSBjb3IoZGF0YVsscmVsZXZhbnRfbmFtZXMyXSwgdXNlID0gImNvbXBsZXRlLm9icyIpCmNvcnJwbG90KHJlbGNvcnMyLCB0eXBlID0gImxvd2VyIikKYGBgCkl0IHNlZW1zIGFzIHRoZSBwcmljZSBwZXIgc3F1YXJlIGZlZXQgaGFzIHJpc2VuIG92ZXIgdGhlIHllYXJzLiBMZXQncyBmaW5kIGl0IG91dDoKYGBge3J9CnNjYXR0ZXJwbG90KGRhdGEkWWVhckJ1aWx0LCBkYXRhJHBwc3FmLCByZWdMaW5lID0gbGlzdChjb2w9ImdyZWVuIiksIHNtb290aD1saXN0KGNvbC5zbW9vdGg9InJlZCIsIGNvbC5zcHJlYWQ9ImJsYWNrIikpCmBgYApXZSB3ZXJlIHJpZ2h0LgoKCiMjIE1vZGVsIEJ1aWxkaW5nOgoKTGV0J3MgYnVpbGQgc29tZSBtb2RlbHMuIEkgd291bGQgbGlrZSB0byB1c2UgdGhlIGNhcmV0IHBhY2thZ2UgdG8gYnVpbGQgYSBsaW5lYXIKYW5kIGFuIGV4cG9uZW50aWFsIHJlZ3Jlc3Npb24gbW9kZWwuICAKTGV0J3MgY3JlYXRlIHRoZSBkYXRhIHBhcnRpdGlvbnMgZmlyc3Q6CmBgYHtyfQpsaWJyYXJ5KGNhcmV0KQoKdGFyZ2V0IDwtIGRhdGEkU2FsZVByaWNlCnRyYWluSWR4IDwtIGNyZWF0ZURhdGFQYXJ0aXRpb24odGFyZ2V0LCBwID0gLjc1KQp0cmFpbmRhdGEgPC0gZGF0YVt0cmFpbklkeCRSZXNhbXBsZTEsXQp0ZXN0ZGF0YSA8LSBkYXRhWy10cmFpbklkeCRSZXNhbXBsZTEsXQpgYGAKCmBgYHtyfQpzdHIodHJhaW5kYXRhKQpgYGAKCkFmdGVyIGNyZWF0aW5nIHRoZSBwYXJ0aXRpb25zLCBsZXQncyBidWlsZCB0aGUgbW9kZWwuICAKRmlyc3QgbGV0J3MganVzdCB1c2Ugb25lIHBhcmFtZXRlcjoKYGBge3J9Cm1vZGVsIDwtIGxtKFNhbGVQcmljZX5PdmVyYWxsUXVhbCwgZGF0YSA9IHRyYWluZGF0YSkKc3VtbWFyeShtb2RlbCkKYGBgCgoKCmBgYHtyfQpwbG90KG1vZGVsKQpzaGFwaXJvLnRlc3QobW9kZWwkcmVzaWR1YWxzKQpjb25maW50KG1vZGVsKQpjb3IodHJhaW5kYXRhJFNhbGVQcmljZSwgbW9kZWwkZml0dGVkLnZhbHVlcykKbW9kZWwKYGBgCmBgYHtyfQpwcmVkaWN0aW9uIDwtIHByZWRpY3QobW9kZWwsIHRlc3RkYXRhLCB0eXBlPSJyZXNwb25zZSIpCm1vZGVsX291dHB1dCA8LSBjYmluZCh0ZXN0ZGF0YSwgcHJlZGljdGlvbikKCm1vZGVsX291dHB1dCRsb2dfcHJlZGljdGlvbiA8LSBsb2cobW9kZWxfb3V0cHV0JHByZWRpY3Rpb24pCm1vZGVsX291dHB1dCRsb2dfU2FsZVByaWNlIDwtIGxvZyhtb2RlbF9vdXRwdXQkU2FsZVByaWNlKQoKcm1zZSA8LSBmdW5jdGlvbihmaXR0ZWR2YWxzLCB0cnVldmFscyl7CiAgc3FydChtZWFuKChmaXR0ZWR2YWxzIC0gdHJ1ZXZhbHMpXjIpKQp9CgpybXNlKG1vZGVsX291dHB1dCRsb2dfU2FsZVByaWNlLG1vZGVsX291dHB1dCRsb2dfcHJlZGljdGlvbikKYGBgCkFzIHdlIGNhbiBzZWUsIG91ciBtb2RlbCBpc24ndCBhbnkgZ29vZC4gTGV0J3MgdHJ5IGEgZGlmZmVyZW50IGFwcHJvYWNoLCB3aXRoIAptb3JlIHBhcmFtZXRlcnM6CmBgYHtyfQptb2RlbDIgPC0gbG0oU2FsZVByaWNlfk92ZXJhbGxRdWFsK0dyTGl2QXJlYSwgZGF0YSA9IHRyYWluZGF0YSkKc3VtbWFyeShtb2RlbDIpCmBgYApMZXQncyBldmFsdWF0ZSBpdDoKYGBge3J9CnBsb3QobW9kZWwyKQpzaGFwaXJvLnRlc3QobW9kZWwyJHJlc2lkdWFscykKY29uZmludChtb2RlbDIpCmNvcih0cmFpbmRhdGEkU2FsZVByaWNlLCBtb2RlbDIkZml0dGVkLnZhbHVlcykKbW9kZWwyCmBgYApgYGB7cn0KcHJlZGljdGlvbjIgPC0gcHJlZGljdChtb2RlbDIsIHRlc3RkYXRhLCB0eXBlPSJyZXNwb25zZSIpCm1vZGVsMl9vdXRwdXQgPC0gY2JpbmQodGVzdGRhdGEsIHByZWRpY3Rpb24yKQoKbW9kZWwyX291dHB1dCRsb2dfcHJlZGljdGlvbiA8LSBsb2cobW9kZWwyX291dHB1dCRwcmVkaWN0aW9uKQptb2RlbDJfb3V0cHV0JGxvZ19TYWxlUHJpY2UgPC0gbG9nKG1vZGVsMl9vdXRwdXQkU2FsZVByaWNlKQoKcm1zZShtb2RlbDJfb3V0cHV0JGxvZ19TYWxlUHJpY2UsIG1vZGVsMl9vdXRwdXQkbG9nX3ByZWRpY3Rpb24pCmBgYAoKVGhpcyBzb21laG93IGFjdHVhbGx5IHdvcnNlbmVkIG91ciBtb2RlbCwgSSdtIG5vdCBleGFjdGx5IHN1cmUgd2h5LgpBbnl3YXksIGxldCdzIHRyeSB0byBnaXZlIGl0IG1vcmUgcGFyYW1ldGVyczoKCmBgYHtyfQptb2RlbDMgPC0gbG0oU2FsZVByaWNlfk92ZXJhbGxRdWFsK1llYXJCdWlsdCtZZWFyUmVtb2RBZGQrVG90YWxCc210U0YrWDFzdEZsclNGKwogICAgICAgICAgICAgICBHckxpdkFyZWErRnVsbEJhdGgrVG90Um1zQWJ2R3JkK0dhcmFnZUNhcnMrR2FyYWdlQXJlYStwcHNxZiwgZGF0YSA9IHRyYWluZGF0YSkKc3VtbWFyeShtb2RlbDMpCmBgYApMaW5lYXIgcmVncmVzc2lvbiBtb2RlbHMgYXJlIGNsZWFybHkgbm90IHRoZSB3YXkgdG8gZ28uIExldCdzIHRyeSBzb21lIGV4cG9uZW50aWFsCm9uZXMuCmBgYHtyfQptb2RlbDQgPC0gbG0obG9nKFNhbGVQcmljZSkgfiBPdmVyYWxsUXVhbCwgZGF0YSA9IHRyYWluZGF0YSkKc3VtbWFyeShtb2RlbDQpCmBgYAoKYGBge3J9CnBsb3QobW9kZWw0KQpzaGFwaXJvLnRlc3QobW9kZWw0JHJlc2lkdWFscykKY29uZmludChtb2RlbDQpCmNvcih0cmFpbmRhdGEkU2FsZVByaWNlLCBtb2RlbDQkZml0dGVkLnZhbHVlcykKbW9kZWw0CmBgYApBcyB3ZSBjYW4gc2VlLCB0aGlzIG1vZGVsIGlzIGZhaXJseSBiZXR0ZXIgdGhhbiBvdXIgcHJldmlvdXMgYXR0ZW1wdHMuCkxldCdzIHRyeSB0aGUgc2FtZSB0aGluZywgd2l0aCBtb3JlIHBhcmFtZXRlcnMuCgpgYGB7cn0KbW9kZWw1IDwtIGxtKGxvZyhTYWxlUHJpY2Upfk92ZXJhbGxRdWFsK1llYXJCdWlsdCtZZWFyUmVtb2RBZGQrVG90YWxCc210U0YrWDFzdEZsclNGKwogICAgICAgICAgICAgICBHckxpdkFyZWErRnVsbEJhdGgrVG90Um1zQWJ2R3JkK0dhcmFnZUNhcnMrR2FyYWdlQXJlYStwcHNxZiwgZGF0YSA9IHRyYWluZGF0YSkKc3VtbWFyeShtb2RlbDUpCmBgYAoKYGBge3J9CnBsb3QobW9kZWw1KQpzaGFwaXJvLnRlc3QobW9kZWw1JHJlc2lkdWFscykKY29uZmludChtb2RlbDUpCmNvcih0cmFpbmRhdGEkU2FsZVByaWNlLCBtb2RlbDUkZml0dGVkLnZhbHVlcykKbW9kZWw1CmBgYAoKQXMgd2UgY2FuIHNlZSwgb3VyIG1vZGVsIGRpZCBub3QgaW1wcm92ZSwgb24gdGhlIG9wcG9zaXRlLCBpdCB3b3JzZW5lZC4KCkkgd291bGQgbG92ZSB0byBjb250aW51ZSB3b3JraW5nIG9uIHRoaXMgZXhlcmNpc2UsIGltcHJvdmluZyBteSBtb2RlbHMsIGV4YW1pbmluZwp0aGUgdGFyZ2V0IHBhcmFtZXRlciBpbiByZWxhdGlvbiB0byBncm91cHMgb2Ygb2JqZWN0cywgc3VjaCBhcyBjaXR5IGFyZWFzLCBvciAKdHlwZXMgb2YgYnVpbGRpbmdzLCBidXQgc2FkbHkgSSdtIG91dCBvZiB0aW1lLgoKQXMgYSBsYXN0IHRoaW5nLCBJIHRyaWVkIHRvIGNoZWNrIG91dCB0aGUgd2hldGhlciB0aGUgcHJpY2VzIGFuZCB0aGUgc2l6ZXMgb2YgCmxpdmluZyBhcmVhcyBhcmUgYWZmZWN0ZWQgYnkgdGhlIG5laWdoYm9yaG9vZC4gV2UgY2FuIG9ubHkgc2VlIHNvbWUgY29ycmVsYXRpb24KaW4gdGhlIG91dGxpZXJzLgoKYGBge3J9CnBsb3QoZGF0YSRHckxpdkFyZWEsIGRhdGEkU2FsZVByaWNlLCBjb2w9ZGF0YSROZWlnaGJvcmhvb2QpCmBgYAoKQXMgYSBjb25jbHVzaW9uLCBJIHN0aWxsIGhhdmUgYSBsb3QgdG8gbGVhcm4gYW5kIHdvdWxkIHJlcXVpcmUgYSBsb3QgbW9yZSB0aW1lIHRvCnByb3Blcmx5IHNvbHZlIHRoZSBwcm9ibGVtLiBNeSBjdXJyZW50IGJlc3QgbW9kZWwgb25seSB1c2VzIGEgc2luZ2xlIHBhcmFtZXRlci4KSWYgSSBoYXZlIHRoZSB0aW1lIGZvciBpdCBpbiB0aGUgZnV0dXJlLCBJJ2xsIHJldHVybiB0byB0cnkgdG8gc29sdmUgaXQgcHJvcGVybHkuCgoKCg==